Data exploration

Let's take a closer look at our data.

Find all NaN and inf in our data.

All inf most likely have the same meaning in the data as NaN. Let's replace all inf on NaN

Print the number of NaNs for each function, the values are sorted in descending order

We found 1095 NaN in several features. This means that this feature contains NaN for ALL objects!!

Let's count this empty features.

We will remove them later

Let's plot NaN distribution for features which contains NaN's

Box plot analysis

75% of features in sample contains 250 or less NaN fields

Median is 21 NaN's

We CAN remove all features that contains more than 579 NaN's. But we will transform it in categorical data. 1 if dataset contain data and 0 if NaN

Let's test at what percentage of missing data the baseline RandomForest model show the best result

We have shown in practice that the previously selected value (579=53%~50%) is the best.But after this experiment we will use 90% data with NaN, beacause this show better results. Also, further we will use the 50 nearest neighbors to fill the NaN in the data.

Plot class distribution

Classes are relatively balanced

Let's fix the result by adding a function that returns the dataset cleared of NaN

Let's plot and try to describe our data. For this task we will use T-SNE. This method helps us to show distance between classes in 2D plot (Our initial data has 1543 dimension)

The graph shows the areas of dominance of certain classes. But data is very difficult to separate with planes.

Perhaps some of the data is our outliers with a lot of gaps. Let's test this hypothesis by reducing the number of gaps to 10%.

The result was almost unchanged. Data is mixed. Leave the threshold at 579 ~ 0.53

Conclusion

1) We have a dataset with more features (1438) than objects (1095)

2) The data strongly mixed

3) We have 2 classes with 60 on 40 distribution

Model training

Let's try to find model wich can better describe our data

Let's test our default baseline model of RandomForest

Commit this baseline

Result on kaggle public test: 0.86341

We can improve our baseline result with fine tuning

Let's try to find similar tasks in the internet

This kaggle competition is very similar to our challenge.

https://www.kaggle.com/c/lish-moa

I found two winners result:

1) https://www.kaggle.com/kokitanisaka/moa-ensemble#ResNet

2) https://www.kaggle.com/kento1993/nn-svm-tabnet-xgb-with-pca-cnn-stacking-without-pp

They used Dense Neural Network, CNN and XGBoost. Let's try to do this.

Train NN

Model was trained in python script. I just copied code in notebook and tried to execute that.

Results and additional research

Model_Name:ROC AUC 1) XGB: 0.88201

2) RandomForest Classifier: 0.87439

3) DenseNeuralNetwork: ~0.84

4) CNN (2D): ~0.82

The best result was shown by the XGB model.But according to my expectations, the CNN model should have had the best result. Such a bad result may be due to misinterpretation of the data (due to a misunderstanding of the business process).

What can be improved: 1) Build 1D CNN

2) Try another autoencoder architecture

3) Try stacking models.